Scaling in the structure of directory trees in a computer cluster 
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We describe the topological structure and the underlying organization principles of the directories 
created by users of a computer cluster when storing his/her own files. We analyze degree distribu- 
tions, average distance between files, distribution of communities and allometric scaling exponents of 
the directory trees. We find that users create trees with a broad, scale-free degree distribution. The 
structure of the directories is well captured by a growth model with a single parameter. The degree 
distribution of the different trees has a non-universal exponent associated with different values of 
the parameter of the model. However, the distribution of community sizes has a universal exponent 
analytically obtained from our model. 
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The processes of storing and retrieving information are 
rapidly gaining importance in science as well as society as 
a whole^ y> S - A considerable effort is being under- 
taken, firstly to characterize and describe how publicly 
available information, for example in the world wide web, 
is actually organized, and secondly, to design efficient 
methods to access this information. It seems clear that 
to design methods for accessing information we first need 
to know how information is actually stored or organized 
as it is being produced. 

Within this general framework a crucial step in build- 
ing general knowledge on these processes, is the under- 
standing of how each of us organizes knowledge and in- 
formation produced by ourselves. To be specific, we pose 
the question of general organizational principles in the 
managing of our own electronic files. To answer this 
question we analyze the structure and organization of 
the files stored in a computer cluster by the users of the 
computer facilities at a research institute. Within the 
general study of complex networks, we are here looking 
at trees and we report a first observation of the scale free 
property in trees. It is important to point out that we are 
not studying a single large tree, but rather we are con- 
sidering a forest of many trees, each of them being the 
result of an individual construction. We are then able to 
consider samples of organizational schemes of many dif- 
ferent sizes, since each user has created a structure with a 
different number of directories. This allows the study of 
different samples of the same reality. We also note that 
contrary to other networks like the WWW or food webs, 
the structures considered here are not the outcome of a 
collective action but the creation of a single individual. 
Our research gives information about the management of 
information at the individual level. 

Two a priori possible answers to the question posed are 
that we follow a random process of file storing or that, 
on the contrary, we implement a careful planned struc- 
ture as we do when organizing the sections and chapters 
of a PhD thesis or a scientific paper. What we find is 
the signature of a complex system halfway between these 



two possibilities, but still with well defined patterns of 
organization. In this paper we report an extensive char- 
acterization of individual user computer directory trees, 
calculating a number of quantitative measures. These 
include de^ee distributions, average distance between 
directories [3, 0, S 0j distribution of community sizes 
in the tree |9| and allometric scaling exponents [lO|, . 
Our data turns out to be well described by a directory at- 
tachment model for constructing the tree. The model de- 
pends on a single parameter q that interpolates between 
random placement of new directories and the agglomer- 
ation into a star structure. The trees of the different 
users are described by different values of the parameter 
q: diversity in individual behavior here boils down to a 
different value of a parameter. 

Data analysis - The data material under considera- 
tion is taken from the computer facilities of the Cross- 
disciplinary Physics Department of IMEDEA (Mediter- 
ranean Institute for Advanced Studies). The personal 
accounts of the 63 users running Linux and UNIX have 
been considered. The users include academic staff, post- 
docs, graduate students and long-time visitors. Each user 
is able to choose freely his/her own organizational scheme 
without specific software. The nodes in the directory tree 
of a given user are all directories (file folders) stored in 
the user's computer account. There is a direct link be- 
tween nodes i and j if directory i is a subdirectory of 
directory j or vice versa. We consider the trees as rooted 
with the home directory as the root. In the following, we 
analyze the trees in terms of the distributions of degree 
and of community sizes as well as the allometric scaling. 

A local measure of the importance of a given node 
i is the nodal degree ki counting the number of nodes 
directly connected to i. In a tree of N nodes the average 
degree is always (k) — 2 — 2/N. The distribution of the 
degree, however, varies strongly across different types of 
structures. The distribution is narrow in simple chains 
and binary trees while it is broadest for a star (having 

— 1 nodes with degree k — 1 and one center node 
with degree k = A^ — I). The degree distributions of 
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FIG. 1: Scaling in the distributions of branching ratio (de- 
gree) and sizes of the communities (subtrees), (a) Cumula- 
tive degree distributions for the ten largest trees. The dashed 
lines have slopes -1 and -2 indicating degree exponents be- 
tween 7 = 2 and 7 = 3. In the whole data set, however, 
exponents 7 > 3 have been observed as well, (b) Cumulative 
distributions of community size plotted as in (a) . The dashed 
lines have slope -1 corresponding to community size exponent 
T — 2. The overall cumulative distribution of the sizes of 
the 16452 communities in all 63 directory trees (thick solid 
curve) and the surrogate data from randomized trees (dot- 
dashed curve) are shown as well, (c) AUometric scaling: Each 
data point (small circle) shows cumulative community size C 
(sum of sizes of all sub-communities) against the size A of the 
community itself. Logarithmic binning is applied to the orig- 
inal data (large circles) and the surrogate data from random- 
ized trees (squares). The inset shows the binned original data 
rescaled with A (circles) and best fits for logarithm (solid line) 
and power law (dotted curve). The surrogate data in (b) and 
(c) are taken from 6300 trees, 100 trees obtained from each 
original tree by independent random rewiring. Rewiring is 
performed by iteratively swapping two randomly chosen node 
disjoint subtrees that do not contain the root. This standard 
network randomization procedure [l^ , here applied to rooted 
trees, conserves the degree distribution. 



the observed directory trees (Fig. ^a)) lie in between 
these two extremes. Directory trees are scale-free. The 
probability of finding a node with degree k decays as a 
power law with a cut-off at the maximum degree 
kmax due to finite size. There is no indication of an 
upper bound on the degree that would limit the scaling 
at large k. Given trees generated by different users, the 
observed values of 7 do not coincide in general. The 
degree exponent is not universal. 

An alternative characterization of the trees is obtained 
by iterative decomposition into subtrees rather than sin- 
gle nodes. Here we consider the community structure of 
the trees. For each node i, a community 5"^ is the sub- 
tree rooted at the node i and all nodes below i. In the 
directory trees, a community Si is the tree formed by 
a directory i, all its subdirectories, the subdirectories of 
these and so forth. A community Si is again a rooted tree 
with node i as the root. Calculating the sizes Ai = \Si\ 
of all communities for each tree, we find the statistics in 
Fig. n^b) . The distribution of community sizes decays as 
a power law A^"^. The exponent t — 2 appears to be 
universal. The scaling of community size ^ is a prop- 
erty independent of the scaling of the degree k. When 
the trees are randomized under conserving degrees of all 
nodes, the functional form of the community size distri- 
bution changes and obtains a scaling region with a larger 
exponent r > 2. 

In order to capture also the correlations between com- 
munity sizes we perform allometric scaling analysis 
For each community Si we calculate the quantity Ci — 
'^j^g. Aj, i.e. we sum up all the sizes of all communities 
contained in Si, including Si itself. Figurenic) shows the 
data point {Ai, Ci) for each community i in the 63 trees. 
We find that the growth of C with A is superlinear. 

Modeling.- Let us now consider a stochastic model for 
the construction of a directory tree. We assume that 
users build their trees by iteratively adding nodes, i.e. 
creating new directories. Then for each possible tree the 
model assigns an attachment probability to each of the 
nodes. The attachment probability 11^ of a node i is 
the probability that i becomes the parent of the next 
added node. In the simplest case, the structure of the 
tree is irrelevant for the attachment process. Then we 
have homogeneous attachment. Each directory has the 
same probability to become the parent of a new direc- 
tory. Another conceivable rule is copying of directories. 
If directories are chosen for duplication with equal prob- 
ability, a directory obtains a new subdirectory with a 
probability proportional to the number of subdirectories 
it already has. Here we formulate a model comprising 
both these mechanisms at tunable ratio. In a tree with 
N nodes, a node with degree k becomes the parent of the 
next added node with probability 
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FIG. 2: Estimating the model parameter q from the empirical 
trees, (a) Independent estimates of q from the moments (fc") 
of the degree distribution coincide for each tree. Estimates 
are plotted for n — 2 (diamonds), n = 3 (triangles), and 
n = 4 (squares). Tree index reflects the ordering of the trees 
with respect to the estimated q. (b) Comparing the q values 
estimated from the average path length and the third moment 
of the degree for each tree. For all estimates in (a) and (b) 
the following method is used. Given an empirical tree of size 
TV with observable Xcmp, 10^ parameter values q G [0, 1] are 
drawn equally distributed. For each value drawn an artificial 
tree of size A'^ is generated by the model. The tree is accepted 
if its value a;niodci of the considered observable does not differ 
by more than 10% from the empirical value Xamp- We take (q) 
as the average over parameter values of all accepted trees. The 
range of an error bar in (b) indicates the standard deviation 
of q across the accepted trees. 



The tunable parameter q € [0,1] is the probabihty that 
duphcation of a node is performed. With probabihty 
1 — q a randomly chosen node is the parent of the added 
directory. Qualitatively q measures how often the indi- 
vidual creating the tree likes to subdivide a directory. 
Note that the pure rules (g = 0, g = 1) cannot produce 
trees as in Fig. ^a). Homogeneous attachment (g — 0) 
leads to trees with an exponential degree distribution. 
The pure duplication mechanism {q = 1) can only gener- 
ate stars because it cannot turn a leaf into an inner node. 
By rewriting Eq. ^ as n(fc) oc fc'" -|- a with the number 
of links fc'" — k — 1 received after creation of the node 
and the "initial attractiveness" a = l/q — 1 we see the 
equivalence of our model with the network growth model 



by Dorogovtsev et al. [13 , restricted to a single new link 
added per node. The case q = 1/2, giving 7 = 3, is the 
scale- free model by Barabasi and Albert [lj| . For general 
q e [0, 1[, the model produces scale-free trees with degree 
exponent j = 2 + a= l + l/q>2. 

The evolution of community sizes is described by the 
probability 

n(A) = ,^ + (i-,)- = ^. (2) 

that the next node is attached to one of the nodes of 
a given community of size A, thereby incrementing A. 
From a continuous rate equation approach |Q| we obtain 
Ai{N) = (1 — q)N/i + q as the expected size of commu- 
nity Si in a tree of size N . The index i is the time step of 
creation of the community as a single node with A—\. 
The linear growth of A with TV implies that the commu- 
nity size distribution of the model decays asymptotically 
as A"'^ with universal (g-independent) exponent t — 2, 
in agreement with the data. 

For an estimate of the allometric scaling, first note the 
general property Ci = Ai + J^jeSi '^v where the chem- 
ical distance dij is the number of nodes contained in 
the direct path between nodes i and j. Adding a new 
node j* to community Si, the expected distance {dij*) 
from node i is Ci/Ai — 1 for copying and Ci/Ai for ho- 
mogeneous attachment. Thus on average C grows as 
dC/dA = 1 + C/A — q, where the finite difference has 
been approximated by the derivative and the index i is 
suppressed. For the initial condition C(l) = 1 we obtain 
the solution C{A) = A[{1 ~ q)lnA + 1]. The allomet- 
ric scaling of the model trees is linear with logarithmic 
correction. In order to compare with the observed trees 
we replot the binned data as {Ai, Ci/Ai) in the inset of 
Fig-dc). The data are captured well by a logarithmic 
dependence (best fit C/A = 0.59 In A -I- 0.99, correlation 
coefficient r = 0.997) in good agreement with the predic- 
tion of the model. 

In order to provide a more stringent check of the valid- 
ity of the model (Eq. ^) we first project the trees into a 
space of four observables, namely the second, third and 
fourth moments of the degree distribution and the aver- 
age chemical distance between nodes. For a given value 
X of an observable and given tree size A'^ we estimate the 
most likely parameter value by weighting all possible 
values g G [0,1] with the probability that they produce x 
up to a small error. Figure 12 shows the results and gives 
details of the method in the caption. For almost all trees 
there is excellent agreement between the four parame- 
ter estimates based on different observables. Thus after 
choice of a single parameter the model accurately repro- 
duces the projection of the trees into a four-dimensional 
space. The projection takes into account the distribu- 
tion of the degree as a local property, and the average 
distance (dij) between nodes as a global property. This 
is strong evidence that the proposed growth mechanism 
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produces statistically the same structures as seen in the 
directory trees. 

Discussion - The structure of directory trees has been 
characterized from a statistical point of view. Our main 
result is the striking structural similarity between trees 
created by independent users in the absence of common 
constraints. Users create trees with a broad, scale-free 
degree distribution with a non- universal exponent. The 
distribution of community sizes, however, scales with a 
universal exponent r w 2. The allometric scaling is lin- 
ear with a logarithmic correction. Community structure 
and allometric scaling are significantly different in ran- 
dom surrogate trees with the same degree distribution. 
The statistical properties of the empirical trees are repro- 
duced by a model that generates trees by adding nodes it- 
eratively. The model has a single parameter q controlling 
the tendency to accumulate many subdirectories in the 
same parent directory. By varying q, the degree exponent 
can be tuned in the empirically observed range 7. The ex- 
ponent r = 2 and the allometric scaling C ~ ^ In ^ have 
been derived analytically and are independent of the pa- 
rameter q. The validity of the model has been evidenced 
further by determining the most likely value of the pa- 
rameter q. For a given tree, estimates based on different 
moments of the degree distribution as well as the diam- 
eter coincide, while estimates vary across trees. Con- 
sequently, directory trees can be distinguished by their 
specific value of the growth parameter q. 

A generally interesting question is to decide about uni- 
versal properties and universality classes of different nat- 
ural and artificial or man-made complex networks. The 
community distribution exponent r ~ 2 that we find 
for our directory trees is in agreement with the one re- 
ported for the Internet [isl Il6| and for the communi- 
ties of scientific collaborations \v\ . However a dif- 
ferent class is formed by river networks Efl l2ll | , in- 
formal networks in organizations '9\ and jazz musician 
networks 18\, where the corresponding exponent gives 
a value t ~ 1.45 0|. These examples seem to belong 
to the class of efficient networks obtained from an opti- 
mization principle in which transportation costs are min- 
imized Ini. For the class of efficient networks one can 
prove [lot I22I |23| that allometric scaling is given by a a 
power law dependence C with a universal expo- 

nent 77 = (D + I)/!), where D is the embedding dimen- 
sion. At difference with the prediction from efficiency, we 
find C ^ A\nA for the directory trees as reproduced by 
our growth model. This result is also compatible with ef- 
fective (apparent) exponents observed in food webs [ll| . 



We have shown that directory trees as individually 
man-made but not designed objects are an interesting 
direction of further research into hierarchical networks. 
Analyzing the wealth of readily available tree data on 
computers around the world offers improved insight into 
how people naturally structure information. 
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